Skip to main content

Reply to this post | Go Back
View Post [edit]

Poster: brewster Date: Dec 28, 2004 2:03am
Forum: texts Subject: Ideas to help proofreading?

While the scanners and qa folks are good, they sometimes miss a page (as was reported recently on a book in the canadian libraries collection).

We would like to figure out how to help the proofreading/qa process (without too much of a burden on the web engineers here).

So we would like to solicit ideas on how we can leverage internal interns as well as volunteers out in web-land.

Ideally bad scans would be caught within a couple of days of posting so that the book can be relocated.

Some of the tools we currently have in place are:
* a per book error reporting system. see the bottom left of each details page:
http://www.archive.org/texts/texts-details-db.php?collection=toronto&collectionid=englishc00caltuoft&from=mostViewed

* error correcting system leveraging curators with different privaledges:
http://www.archive.org/item-reports.php

* cool graphs of progress (see the bottom):
http://www.archive.org/about/graphs.php

what we dont have is a proofreading queue like distributed proofreaders do. We can hold things up in a curation phase so that they are proofed before being made live, but this would hold up the books.

If folks have easy-to-implement ideas, we would love to hear.

-brewster

Reply to this post
Reply [edit]

Poster: Branko Collin Date: Feb 22, 2005 7:56am
Forum: texts Subject: Re: Ideas to help proofreading?

Distributed Proofreaders have developed a number of tools to help us proofread. Although they are likely to be only really useful to DP (the earliest of those started out as tools to check if a text conformed to PG's formatting guidelines), perhaps one of your programmers could look at their source and see if any general rules can be gleaned from them.

Especially useful may be our pre-processing tools, as they try and catch some of the commonest problems.

We search for scanning errors using spell checkers, and for the ones that are valid English words using lists of "stealth scannos". For instance, "and" is commonly mis-OCR-ed as "arid". (Similar lists exists for LOTE.) This method would probably be too time-consuming for TIA, but you could construct a tool that will find spelling errors that are commonly produced by OCR software. "tbe" for "the", for instance.

We have also several anecdotes about how something would appear an error to anyone but a human proofreader: how far you want to take things with automation also depends on how many errors you want to introduce.

If you need one of your interns to actually look at a text, our special proofing font may help; it's ugly as sin, but helps errors really stand out.

Not all of our tools are available through Sourceforge; our Help pages link to them, though.

When you start working with volunteers, try and make it as easy as possible for them to contribute.

Reply to this post
Reply [edit]

Poster: Jon Noring Date: Jan 9, 2005 2:59am
Forum: texts Subject: Re: Ideas to help proofreading?

Hi Brewster,

In a private email I sent you a few days ago I mentioned the idea of working with Distributed Proofreaders to create two new volunteer-driven organizations: "Distributed Scanners" and "Distributed Catalogers" (DistCat). DistCat would be formed to mobilize professional librarian volunteers to assist with entering (and cleaning up) the cataloging/metadata associated with the scanning projects. DistCat could also be involved with QA'ing the scans to make sure everything is there and the scan quality is sufficient -- for this DistCat could mobilize volunteers (besides professional librarians) to assist with those particular tasks. DistCat might also oversee copyright clearance (so maybe another name besides DistCat would be more appropriate.)

There are certainly variants on this idea, so what I propose is simply a starting point for further discussion.

In general, DP has shown the way to mobilize volunteers and provide useful online tools to make the volunteer's jobs a lot easier, whatever they may be. It seems like the whole process of distributed scanning (as a parallel project to the formalized projects such as the Canadian scanning project) and metadata entry can use a similar approach. By this modularization, each component can optimize their area of focus -- for example, DP can now focus more on the proofing process to produce structured digital text, rather than being involved with scan acquisition, metadata/cataloging, copyright clearance, etc., etc. Of course, these different distributed projects would closely work with each other, and may even be governed by a common board.